Project description

As part of Brainnest’s Market Research training, we were given real customer data from Yondo, a service to host and sell audiovisual content. We are tasked with finding trends in Yondo’s data that could help us build a Customer Persona. Ideally, we would supplement the available information with behavioral insights obtained through survey research. For the scope of this project, segmentation is performed based on demographic and economic data. I chose to leverage a k-means clustering algorithm to inform the qualitative build of the persona with some quantitative rigor! I then filtered the data set for customers within the largest segment and built a buyer persona based on their social media presence.

The end result is a Customer Persona Yondo can use to tailor their Marketing efforts.

Exploratory Data Analysis


As we import our data and take a first look we can see that we are dealing with many variables and a lot of missing values.

df <- read.csv("/Users/matiasfonolla/Desktop/Market Research - Training /Market Research/Yondo.xlsx - QUERY_FOR_YONDOFINAL.csv",
               na.strings = "")

# We have 60 variables and 284 observations
dim(df)
## [1] 284  60
# Names of our variables
names(df)
##  [1] "StoreName"                        "Age"                             
##  [3] "Gender"                           "CountryName"                     
##  [5] "City"                             "State"                           
##  [7] "TimeZone"                         "SignUpType"                      
##  [9] "SubscriptionStatus"               "PlanName"                        
## [11] "Industry"                         "AlreadySelling"                  
## [13] "Revenue"                          "CreatedDate"                     
## [15] "subdomain"                        "StoreUrl"                        
## [17] "dasboard.link"                    "F20"                             
## [19] "Is.it.connected.to.their.website" "Are.they.active"                 
## [21] "Notes"                            "Household.Income"                
## [23] "Marital.Status"                   "Home.Owner.Status"               
## [25] "Length.of.Residence"              "Education"                       
## [27] "Occupation"                       "Health...Wellness"               
## [29] "Travel"                           "Auto.Parts"                      
## [31] "Kids...Babies"                    "Nutrition"                       
## [33] "Home...Garden"                    "Garden...Patio"                  
## [35] "Garden.Supplies"                  "Home.Decor"                      
## [37] "Home.Improvement"                 "Kitchen...Dining"                
## [39] "Pets...Supplies"                  "Gift.Buyer"                      
## [41] "Toys"                             "Sports...Outdoors"               
## [43] "Beauty"                           "Mens.Clothing"                   
## [45] "Shoes"                            "Womens.Clothing"                 
## [47] "Jewelry"                          "Electronics"                     
## [49] "Computers...Software"             "Home.Buyer"                      
## [51] "Cord.Cutter"                      "Deal.Seeker"                     
## [53] "Luxury.Shopper"                   "Big.Spender"                     
## [55] "Online.Buyer"                     "PostalCode"                      
## [57] "entertainment.sites"              "Communities"                     
## [59] "Groups"                           "Influencers"
# Every single one of our observations has at least one missing value
df %>%
  count(!complete.cases(.))
##   !complete.cases(.)   n
## 1               TRUE 284


We have a lot of variables that are incomplete and don’t offer clear insights on potential customer segment. Let’s recode the relevant ones as factors and take a deeper look.

df <- df %>%
  mutate(Age = as.factor(Age),
         Gender = as.factor(Gender),
         Industry = as.factor(Industry),
         CountryName = as.factor(CountryName),
         SignUpType = as.factor(SignUpType),
         SubscriptionStatus = as.factor(SubscriptionStatus),
         PlanName = as.factor(PlanName),
         AlreadySelling = as.factor(AlreadySelling),
         Revenue = as.factor(Revenue),
         Are.they.active = as.factor(Are.they.active))

We will recode overlapping categories in key variables

# Re-coding overlapping age categories

unique(df$Age)
##  [1] 25-34 45-54 <NA>  35-44 65+   55-64 54-65 34-45 44-55 45-55
## Levels: 25-34 34-45 35-44 44-55 45-54 45-55 54-65 55-64 65+
df$Age <- replace(df$Age, df$Age == "45-55", "45-54")

df$Age <- replace(df$Age, df$Age == "44-45", "45-54")


df$Age <- replace(df$Age, df$Age == "44-55", "45-54")


df$Age <- replace(df$Age, df$Age == "34-45", "35-44")


df$Age <- replace(df$Age, df$Age == "54-65", "55-64")


# Re-coding wrong PlanName and CountryName as NA 

df$PlanName <- replace(df$PlanName, df$PlanName == "/", NA)
df$CountryName <- replace(df$CountryName, df$CountryName == "0", NA)

Let’s summarize our most relevant variables

df %>%
  select(where(is.factor))%>%
  summary.data.frame()
##       Age         Gender                      CountryName           SignUpType 
##  35-44  : 34   Female:146   United States of America:184   New Signup    :274  
##  45-54  : 31   Male  :105   United Kingdom          : 19   Weebly        :  7  
##  25-34  : 17   NA's  : 33   Australia               : 18   Weebly Upgrade:  3  
##  55-64  : 14                Canada                  : 15                       
##  65+    : 10                Switzerland             :  6                       
##  (Other):  0                (Other)                 : 41                       
##  NA's   :178                NA's                    :  1                       
##  SubscriptionStatus              PlanName  
##  active:284         Starter          :150  
##                     Professional     : 68  
##                     Starter Plus     : 47  
##                     Webinar - Starter:  7  
##                     Trial            :  6  
##                     (Other)          :  5  
##                     NA's             :  1  
##                             Industry  
##  Arts & Crafts                  :  5  
##  Consulting                     : 53  
##  Fitness                        :126  
##  Medical                        : 28  
##  Other                          : 51  
##  Tutoring (Languages, Math etc.): 14  
##  NA's                           :  7  
##                                               AlreadySelling     Revenue   
##  I am already selling online using a different system: 71    0       :116  
##  I haven't yet started selling                       :135    0-5k    : 32  
##  I'm already selling, just not online                : 53    1m+     : 10  
##  I'm just playing around                             : 18    250k-1m : 25  
##  NA's                                                :  7    50k-250k: 43  
##                                                              5k-50k  : 51  
##                                                              NA's    :  7  
##  Are.they.active
##  No  : 81       
##  Yes :201       
##  NA's:  2       
##                 
##                 
##                 
## 

We already have some valuable insights! The most common age group (besides NA, which we should be mindful of) are 35-44 and 45-44. Sample seems to be mostly American and female. Most popular industry is fitness, with most accounts having not yet made online sales or overall revenue despite being active.


Let’s confirm some of these intuitions graphically





It seems that a majority of the sample is in fact comprised by women in the fitness industry. They have an active account but have not yet started selling online or made any revenue off their business.

We will now seek to confirm this intuition statistically through K-mode clustering.


Why K-mode over K-means clustering?

K-Means uses mathematical measures (distance between means) to cluster continuous data. The lesser the distance, the more similar our data points are. However, measures of distance are not truly meaningful when it comes to categorical data (distance between our ‘0’ and ‘1’ dummy variables is always 1).

In K-modes, the data is represented as a set of categorical variables, and the algorithm attempts to partition the data into k clusters based on the modes (most frequent values) of the categorical variables in each cluster. In other words, k-modes defines clusters based on the similarities of the categorical variable patterns in the data (Goyal & Agarwal, 2017)

Cluster Analysis

We will only use our most relevant variables so to not add unnecessary noise to our results

condensed_df <- df%>%
  select(Age, Gender, CountryName, 
         PlanName, Industry, AlreadySelling, Revenue, Are.they.active)


K-mode clustering will not work if we feed it NA values. We will work around this by turning them into a string.

library(gtools)
dfwNA <- condensed_df %>%
  mutate_if(is.factor, as.character)

dfwNA <- gtools::na.replace(dfwNA, replace = "NA")


Clustering algorithms like K-means and K-mode need a pre-specified number of clusters to run. We’ll estimate the ideal number of clusters by gradually increasing the number of clusters (modes) with a loop and comparing their fit.

library(klaR)
set.seed(1222)
Es <- numeric(10)
for(i in 1:10){
  kpres <- kmodes(dfwNA, modes = i, iter.max = 15, fast = TRUE)
  Es[i] <- kpres$withindiff
}
plot(1:10, Es, type = "b", ylab = "Within Cluster distance", xlab = "# Clusters",
     main = "Scree Plot") # figure 2

The lower the within-cluster simple-matching distance (y-axis), the more compact and similar the data points within the cluster are. We can see that, at 4 clusters, we’ve effectively minimized the internal distance (clusters are concise and most dissimilar to the rest) with the least number of clusters. This is known as the ‘elbow’ method.

We can now run our algorithm specifying that the end result needs to return 4 clusters

Running and exploring the K-modes algorithm

mode_clusterswNA <- kmodes(dfwNA, modes = 4, iter.max = 15, fast = TRUE)

clusterswNA <- mode_clusterswNA$modes

clusterswNA <- clusterswNA %>%
  mutate(Size = mode_clusterswNA$size)
Clusters with NA values
Age Gender CountryName PlanName Industry AlreadySelling Revenue Are.they.active Size
NA Female United States of America Professional Fitness I am already selling online using a different system 50k-250k Yes 64
NA Male United States of America Starter Fitness I haven’t yet started selling 0 Yes 58
NA Female United States of America Starter Fitness I haven’t yet started selling 0 Yes 103
NA Male United States of America Starter Plus Other I haven’t yet started selling 0 No 59

```


We can see that the largest cluster (103 costumers) is comprised of American women in the Fitness industry. They have active starter accounts but have not yet started selling their content online. Although we don’t have information about their age this seems to confirm our intuitions about the characteristics of Yondo’s largets costumer segment

Creating a Persona

Our original excel file contained links to costumers’ social media. Filtering for costumers who belong to our cluster of choice, I built a Persona based on this segment’s profile as reflected by their social media presence.

Next steps would involve validating this persona through focus groups, but for now, Meet Jill!